Collecting Bilingual Technical Terms from Patent Families of Character-Segmented Chinese Sentences and Morpheme-Segmented Japanese Sentences

نویسندگان

  • Zi Long
  • Takehito Utsuro
  • Tomoharu Mitsuhashi
  • Mikio Yamamoto
چکیده

In manual translation of patent documents, a technical term bilingual lexicon is inevitable for a translator to efficiently translate patent documents. Dong et al. (2015) proposed a method of generating bilingual technical term lexicon from morpheme-segmented parallel patent sentences. The proposed method estimates Japanese-Chinese translation of technical terms using the phrase translation table of a statistical machine translation model. The procedure of generating bilingual technical term lexicon consists of the following four steps: (1) extracting Japanese technical terms from Japanese side of parallel patent sentences, (2) collecting all the sentences that contain the extracted Japanese term, (3) generating Chinese translation of the Japanese technical term referring to the phrase translation table of a statistical machine translation model, and (4) applying the Support Vector Machines (SVMs) to the task of identifying bilingual technical terms. In this paper, we segment the Chinese sentences into characters instead of segmenting them into morphemes as in Dong et al. (2015), and represent JapaneseChinese patent families in terms of character-segmented Chinese sentences and morphemesegmented Japanese sentences. Then, to those Japanese-Chinese patent families, we apply the framework (Dong et al., 2015) of identifying bilingual technical terms. As a result, we achieve the performance of over 90% precision with the condition of more than or equal to 60% recall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families

In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considers situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studies the issue of identifying synonymous translation equivalent pairs. First, we collect candidates of synonymous trans...

متن کامل

Compositional Translation of Technical Terms by Integrating Patent Families as a Parallel Corpus and a Comparable Corpus

In the previous methods of generating bilingual lexicon from parallel patent sentences extracted from patent families, the portion from which parallel patent sentences are extracted is about 30% out of the whole “Background” and “Embodiment” parts and about 70% are not used. Considering this situation, this paper proposes to generate bilingual lexicon for technical terms not only from the 30% b...

متن کامل

Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families

In the process of translating patent documents, a bilingual lexicon of technical terms is inevitable knowledge source. It is important to develop techniques of acquiring technical term translation equivalent pairs automatically from parallel patent documents. We take an approach of utilizing the phrase table of a state-of-theart phrase-based statistical machine translation model. First, we coll...

متن کامل

Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation

Neural machine translation (NMT), a new approach to machine translation, has achieved promising results comparable to those of traditional approaches such as statistical machine translation (SMT). Despite its recent success, NMT cannot handle a larger vocabulary because training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even...

متن کامل

Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese

Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in differe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015